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ABSTRACT 


Uncertainty is a pervasive feature of the 
domains in which expert systems are designed to 
function. Several methods have been used for 
handling uncertainty in expert systems, 
including probability-based methods, heuristics 
such as those implemented in MYCIN, methods 
based on fuzzy set theory and Dempster Shafer 
theory, and various other schemes. This paper 
reviews research designed to test uncertain 
inference methods for accuracy and robustness, 
in accordance with standard engineering 
practice. We have conducted several studies to 
assess hew well various methods perform on 
problems constructed so that correct answers 
are known, and to find cut what underlying 
features of a problem cause strong or weak 
performance. For each method studied, we have 
identified situations in which performance is 
very good, but also situations in which 
performance deteriorates dramatically. Over a 
broad range of problems, seme well-known 
methods do only about as well as a simple 
linear regression model, and often much worse 
them a simple independence probability model. 
Our results indicate that seme cctimercially 
available expert system shells should be used 
with caution, because the uncertain inference 
models that they implement can yield rather 
inaccurate results. 


nfTRoixjcnoN 


Uncertainty is a pervasive feature of many 
domains in which artificially intelligent 
expert systems are intended to function. 
Researchers in artificial intelligence have 
proposed a variety of approaches to uncertain 
reasoning. Seme (e.g. , 1, 2, 3) have developed 
methods that are explicitly based on 
probability theory. Other approaches, such as 
those used in MYCIN (4, 5) , PROSPECTOR (6) , and 
AI/X (7) , use heuristics designed to 
approximate probability theory. Yet other 
methods involve adaptations of fuzzy set theory 
(8) , Dempster-Shafer theory (9) , and other 
ideas not based on probability- Unfortunately, 
there is no wide consensus concerning which 
approach is best or even suitable for any 
particular application. 


Seme researchers have attempted to ccnpare 
these various approaches through theoretical 
analysis. For exanple, Heckerman (10) has 
shown that the equations which define MYdN's 
certainty factors can be translated into 
probabilistic terms. As another exanple. 

Hunter (11) has investigated conditions under 
which probability theory and Denpster-Shafer 
theory agree. 

Although theoretical analyses can provide 
useful insights, they also become exceedingly 
complex and their usefulness for the average 
practitioner can decrease, particularly when 
heuristics which have no particular theoretical 
justification are being considered. 

Furthermore, these analyses typically focus on 
the formal assumptions of the various 
uncertainty models, which are seldom met in 
practice. Perhaps the most important questions 
concern how models behave when assumptions are 
not met. 

The present authors and various additional 
coauthors have taken a different, empirical 
approach to examining the accuracy of uncertain 
inference models in a series of studies. We 
started wrorking independently, but eventually 
realized the commonalities in our work and 
began to collaborate. 

It should be made clear that we are examining 
the basic inference models used by systems such 
as MYCIN. we are not evaluating any particular 
implementation that uses any given model. In 
our general approach, ansvrers provided by 
probability theory are used as a norm against 
which the accuracy of other uncertain inference 
models may be measured. These studies differ 
in details, but all use the same basic research 
paradigm. First, exanple inference networks 
are constructed so that all relevant parameters 
are known. Next, new values are assigned to 
the evidence nodes, as theugh additional 
information in the form of updated estimates is 
being supplied by a user during a consultation 
session. Conclusion node certainty values are 
calculated which reflect the new information 
according to the model under consideration. 
Finally, these answers are compared to results 
obtained frem a probability-based method which 
provides the minimum cross-entropy solution 
(12) . This approach parallels methods used in 
various scientific and engineering disciplines, 
such as sensitivity analysis and "Monte Carlo" 
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simulations, for investigating the behavior of 
complex systems when assumptions are violated. 

We have completed studies which evaluate the 
MYCIN model, the EROSPECTCR model, 
probability-based models which contain 
simplifying assumptions (e.g. , independence) 
and a single linear model. The objectives of 
the present paper are to review and summarize 
these studies, describe the major objectives 
and findings of each, and discuss the overall 
implications of these findings for expert 
system construction and future research. 

Table I summarizes the studies that will be 
reviewed. All of these studies used the 
general method described above. However, they 
differed in certain important respects as 
well. Seme focused on only a single uncertain 
inference model, while others looked at several 
models simultaneously . Seme used many email, 
randomly-created inference nets, while others 
used larger, selected nets. Finally, sane of 
these studies derived model parameters by using 
published theoretical definitions to translate 
the nets directly, while others "tuned" the 


parameter values. These issues are all 
explained and discussed in more detail below. 

STUDIES USING THEORETICAL PARAMETERS 


In cur methodoloy, inference nets are created, 
solved by a minimum cross-entropy extension of 
probability theory, and also solved by another 
uncertain inference model. A key part of this 
process involved translating between parameters 
suitable for the probability calculations and 
parameters required by the other model. For 
example, the MYCIN model expresses rule 
strengths (relationships between evidence and 
conclusions) in measures of believe (MBs) and 
measures of disbelief (MDs) . The developers of 
MlfCIN provided theoretical definitions of these 
parameters in probability terms. In the first 
three studies shown in Table I, such 
theoretical definitions were used for the 
necessary translations. Consider these 
studies: 


gropy 

UNCERTAIN 

INFERENCE 

MODELS 

INFERENCE 

NETS 

MODEL 

PARAMETER 

ESTIMATION 

Wise (13)* 

Wise & Henrion(14) 
Wise (15) 

MYCIN 

probability 

with 

assumptions 

large, 

selected 

theoretical 

definition 

Perrin, Vaughan, 
Yadrick, Holden, 
& Kenpf (16) 

MYCIN 

nary, snail 

theoretical 

definition 

Yadrick, Perrin, 
Vaughan, Holden, 
& Kenpf (17) 

fRQSFECTCR 

many, small 

theoretical 

definition 

Wise, Perrin, 
Vaughan & 
Yadrick (18) 

MYCIN many, small 

PROSPECTOR 
probability 
w/ith 

assumptions 
linear regression 

tuned 

Wise (19) 

PROSPECTOR 

probability 

many, small, 
selected 

tuned 


with 

assumptions 
linear regression 

* Wise & Henrion (14) and Wise (15) both contain summaries of results 
which are presented in more detail in Wise (13) . 

Table I Evaluation Studies Reviewed 
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o Wise 

Wise (13) presented a detailed theoretical 
analysis of the MYCIN model, as well as several 
other m odels. He also discussed in the 

rationale for accepting the minimum 
cross-entropy probability solution as an 
appropriate criterion for evaluating other 
uncertain inference models. Highlights of this 
work appear in Wise & Henrion (14) , which 
presents the methodology and sane preliminary 
results, and in Wise (15) , which summarizes 
results for the MYCIN model. 

Ihe MYCIN model (5) was one of the first to be 
used for handling uncertainty in an expert 
system. It was designed to solve sane problems 
that the developers believed made probability 
theory unsuitable for their application. The 
model was also designed to approximate 
probability calculations while being modular, 
oanputationally efficient, and more natural for 
their subject matter experts to use. The 
original concerns the developers had with 
probability theory acre probably not valid 
(e.g. , 20). However, the model and several 
variants are widely cited and used today, 
particularly in several commercial "shells". 
Thus, information about the accuracy of the 
MYCIN m odel continues to have practical 
relevence for a large community. 

Based on his detailed theoretical analyses and 
an critical examples cited in the literature, 
Wise constructed sane sets of inference nets 
with associated rule strengths (defined as 
probabilities) for which the MYCIN model was 
predicted to be reasonably accurate, and seme 
for which large errors were predicted. The 
resulting 30 nets ranged in size from the 
simplest (two evidence nodes and one conclusion 
node) to nets with three evidence nodes, 
multiple conclusion nodes and an intermediate 
node level. One net comprised nine evidence 
nodes, four intermediate nodes, and four 
conclusion nodes. Correlations between pieces 
of evidence were also varied systematically 
between strong positive and strong negative 
associations. With two exceptions, rules in 
the nets were conjunctive ("AND") rules. To 
generate test problems, he systematically 
varied "updated" or input evidence 
probabilities over four values for each 
evidence node. This means, for exauple, that a 
net with three evidence nodes yielded 64 
(4x4x4) problems. Each problem was solved 
using the probability model, the MYCIN model, 
and several models based on probability theory 
with sitrplfying assumptions. For each 
inference net he computed g, the mean squared 
difference between the maximum entropy 
probability answers and the inference model 
answers across the set of problems. 

The MYCIN model was most accurate for cases in 
which there was very little difference between 
the base rate (prior probability) of the 
conclusion and the conditional probability of 
the conclusion when both pieces of evidence 


were false (car absent) . For exairple, the value 
of g .0004 in one such case and .0005 in 
another. Conversely, MYCIN was most inaccurate 
when there was a large difference between the 
conclusion base rate and the conditional 
probability of the conclusion when both pieces 
of evidence were false. The value of g was 
.03, .09, .03, and .04 in four such cases. 

These results are attributable to two features 
of the MYCIN model. First, based on 
theoretical definitions, MYCIN ignores negative 
evidence. That is, if the updated probability 
for a piece of evidence is greater than the 
prior probability (base rate) for that 
evidence, MYCIN updates conclusion 
probabilities associated with the evidence. 
However, if the updated probability for a piece 
of evidence is below the base rate, MYCIN 
ignores that information and conclusion 
probabilities are not updated. The other 
feature concerns the method used to carbine 
evidence for conjunctive (AND) miles. This 
method "pays attention to" only one of the 
pieces of evidence involved. As a consequence 
of these features, MYCIN provides accurate 
answers when the inpact of ignoring negative 
evidence is minimized, i.e. , then the 
conditional probability of the conclusion is 
high given that the evidence is absent. 

The robustness of an uncertain inference model 
can be assessed by examining reasons for its 
worst performance. In this lic£it. Wise 
compared the MYCIN model to a simple 
probability model which assumes conditional 
independence. Across the set of nets he 
studied, the conditional probability model was 
considerably more robust (largest g = .04) than 
the MYCIN model (largest g = .09) . The MYCIN 
model was very accurate on sane nets, but very 
inaccurate on other nets. 

o Perrin, et al. 

Perrin, Vau^ian, Yadrick, Holden, and Kempf 
(16) also studied the MYCIN model. However, 
the inference nets were constructed in a 
scmewhat different way. In this study, only 
the simplest sorts of nets were studied, i.e., 
those comprising two pieces of evidence and one 
conclusion. These networks are the basic 
building blocks of larger networks; inference 
in these nets requires both evidence combining 
and propagation. In this study, many nets were 
constructed by random sampling from the 
universe of three-node nets. In particular, 

200 nets were oenpiled in which the pieces of 
evidence were independent and 200 were oenpiled 
in which the pieces of evidence were 
statistically associated. Problems were 
generated by independently varying the updated 
evidence probabilities over five values, since 
all nets had two evidence nodes, this created 
25 problems for each net. As in (13) , each 
problem was also solved using the minimum 
cross-entropy probability model, Next, each net 
was translated into MYCIN parameters using the 
theoretical definitions, and was solved using 
all three of MYCIN's exmbining functions 
(comjuctive, disjunctive, and incremental) . 
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Networks were classified according to \ih ich 
combining function provided the lowest error 
for the network. The incremental function was 
the most accurate for about 60% of the 
networks, the conjuctive function was the most 
accurate for about 35%, and the disjuctive 
function was most accurate for the remaining 
5%. Hie mean absolute error across all nets 
was about .07, while the average maximum error 
per net was about .22. Further analysis 
indicated that much of this error was due to 
MYCIN's ignoring negative evidence. For only 
problems in which MYCXN updated conclusion 
probabilities, the mean error across problems 
and nets was about .02, and the average maximum 
error was about .05. Further analysis 
indicated that MYCXN error was greatest in 
these problems when evidence base rates were 
low and evidenoe-ccnclusion associations were 
strong. These attributes characterize the 
difficult diagnostic process; the results 
suggest that MYCXN will be least accurate in 
precisely the situations for which expert 
systems sue likely to be most valuable. 

o Yadrick, et al. 

Yadrick, Perrin, Vaughan, Holden, & Kenpf (17) 
studied the model used in the FRQSFECTQR system 
(6) . Like the MYCXN model, the ERQSPECTQR 
model was developed to address perceived 
problems with probability theory for expert 
system applications. It was also intended to 
approximate probability calculations while 
being computational ly efficient and modular. 
While this model has received less attention 
than the MYCXN model, it and several variants 
(e.g. , AI/X) have also been implemented in 
ccmnercially-available expert system shells. 

Yadrick, et al. used the same inference net and 
problem generation methods as Perrin, et al. 

All networks contained two evidence nodes and 
one conclusion node. A total of 400 networks 
were sampled which contained independent 
evidence and 400 nets were sampled which 
contained associated evidence. Again, 25 
problems were generated for each net and solved 
using maximum entropy probability 
calculations. The problems were translated 
into EROSPECIOR parameters using theoretical 
definitions and the problems were solved using 
PROSPECTOR conjunctive, disjunctive, and 
independent rule ccnbining functions. For each 
net, the mean squared error was computed and 
the maximum error for a single problem was 
recorded. 

PROSPECTOR error was quite large (often greater 
than .5) for many nets. Extremely large errors 
were found mainly for nets in which the 
probability of the conclusion was high if one 
piece of evidence was true and one was false, 
tut was not as high if both pieces of evidence 
were either true or false We concluded that 
the PROSPECTOR model is fundamentally incapable 
of handling these ''counterintuitive" nets, and 
excluded than from futher analysis. This left 
66 independent and 73 associated evidence nets 
for additional consideration . 


The independence ccnbining function was most 
accurate for about 90% of the remaining 
independent nets and about 80% of the remaining 
associated nets. The overall average error was 
about .014 for independent nets and about .022 
for associated nets; overall maxim mi error was 
about .055 for independent nets and about .083 
for associated nets. Further analysis 
indicated that error was greatest when the 
evidence is most strongly associate with the 
conclusion. Moreover, the error can be 
ccnpcunded or mitigated by the values of 
updated evidence probabilities. In summary, 
the PROSPECTOR model was quite accurate for 
seme problems and networks, but very inaccurate 
for others over a wide range of new evidence 
probabilities. Like MYCIN, it appears to be 
least accurate in the typical situations to 
which it would likely be applied. 


TUNED MODEL PARAMETERS 


The studies described above used published 
formal definitions to translate between 
probability model parameters and uncertainty 
model parameters. The hope was to determine 
the absolute degree of error and provide a 
theoretical explanation for sources of error 
produced by the uncertainty models. Despite 
seme success in this, practical applications of 
the findings are limited, precisely because we 
used formal uncertainty model parameter 
definitions. These theoretical definitions 
have little relevence to knowledge engineers 
building real expert systems, because 
parameters are typically estimated by experts 
based on an intuitive rather than a formal 
understanding. Then the parameters are 
"tuned", or adjusted interactively by the 
experts and knowledge engineers to obtain the 
most accurate results on the data used for 
system development. The relationship between 
parameters estimated in this way, the formal 
definitions of the parameters, and probability 
theory is not clear. Furthermore, the tuning 
process may correct seme or all of the errors 
observed in the studies described above. 

This tuning issue lead us to do two addit ional 
studies (18, 19). The objective was to study 
the errors made by uncertain inference models 
empirically after their parameters have been 
tuned. As before sanple networks were 
created, and problems were run by 
systematically varying updated evidence 
probabilities. Problem solutions produced by 
uncertain inference models were compared to the 
same minimum cross-entropy probability norm. 
This time, however, the model parameters were 
optimized for each net ("tuned") so that the 
model's answers were as close to the 
probability answers as possible, on the 
average. These solutions, therefore, represent 
the best performance that could be achieved by 
each model. 
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o Wise, et ad. 


/cwr different inference methods were examined 
by Wise, Perrin, Vaughan, & Yadrick (18).. 

These included the MYCIN and PROSPECTOR models 
as before. We also included a linear 
regression model, given by the following 
equation: 


P' (C) = a + bl*P' (El) + b2*P* (E2) . (1) 


In this equation and below, P' (x) is the 
updated probability for the event x, and a, bl, 
and b2 are constant parameters, which were 
optimized. Linear models have received little 
attention from artificial intelligence 
researchers, althou^i they have been used 
successfully to model a variety of human 
judgments (21) . We included this model to 
provide a baseline against which to ccnpare the 
other models. 

Finally, a probability theory-based 
independence model was also included. This 
model is described by the following equation: 


P'(c) = P'(-El) * P' (-E2) * P(C| -E1&-E2) + 

P' (El) * P'(~E2) * P(C|E1&~E2) + 
P'(-El) * P'(E2) * P(C| -E1&E2) + 

P' (El) * P'(E2) * P(C|E1&E2) . (2) 

Hie model reflects normal probability 
calculations under the assumption that the 
pieces of evidence are independent. Hie four 
conditional probabilities are the model 
parameters (which were optimized) . After 
parameter optimization, this model is 
equivalent to a linear regresion model with an 
interaction term. 

A total of 109 two-evidence, one-conclusion 
networks were sampled using procedures similar 
to those of (16) and (17) . For each piece of 
evidence, the updated probabilities varied over 
five values so that 25 problems were run for 
each network. For each model, parameter values 
were obtained which minimized, across the 25 
problems, the sum of squared differences 
between the model solutions and the minimum 
cross-entropy answers. This optimization was 
done using a deflected gradient search 
algorithm (22) with appropriate precautions to 
avoid local minima and round-off error 
problems. Table II summarizes the performance 
of the four models. 

The main finding was that the MYCIN (5 
parameters) , PROSPECTOR (7 parameters) , and 
linear (3 parameters) models performed equally 
well (for all practical purposes) , while the 
independence (4 parameters) model was 
significantly more accurate (according to an 
analysis of variance test) . Furthermore, the 
errors for the MYCIN, PROSPECTOR, and linear 
models were highly correlated (Pearson 
product-mcitient coefficient >.95). This shews 


INFERENCE 

AVERAGE 

HIGH 

I CM 

METHOD 

0CE 

RMSE 

FMSE 

MYCIN 

.048 

.152 

.001 

EROSEECTOR 

.047 

.148 

.001 

Linear eq. 

.048 

.152 

.001 

Independence 

.006 

.036 

.000 


Note: This table was taken from (18) . 

FMSE is root mean squared error. 

Table II Tuned Parameter Errors 


that the models all performed well or poorly on 
the same problems. They were behaving almost 
identically for the networks studied here, 
although the linear model requires estimation 
of fewer parameters. A probability 
theory-based independence model performed 
better and required fewer parameters than MYCIN 
or PROSPECTOR. 

o Wise 

The objective of this study (19) was to 
determine the degree to which errors of the 
sort shown in Table I can be attributed to 
assuitption violations in the networks. The 
study included the FROSEECTOR model, the linear 
and marginal independence models (equations 1 
and 2) , and a model that was linear on 
logarithms of odds ratios (i.e. , it substituted 
logs of odds ratios for probabilities in 
equation 1) . The general methodology, 
including network and problem generation and 
model parameter optimization, were the same as 
(18) . Here, however, all networks were 
constructed to meet the PROSPECTOR model's 
conditioned, independence assumptions. Thus, 
all error for FROSPECTOR in these networks 'must 
be due to the approximate updating functions. 

Table III summarizes results for the 
conditional independence networks. In this 
table, errors are expressed in terms of a 


INDEPENDENCE PROSPECTOR IOG-ODDS 

Mean .90 .52 -.36 

Standard- 
ized error .08 .35 .07 

Note: This table summarized from (X) . 

Cell entries are standardized error measure 
(see text) . 

Table III Errors for Conditionally 
Independent Networks 


standardized measure, where 1.0 reflects no 
error and 0.0 reflects the same level of error 
as the linear model. Positive scores on this 
measure indicate better performance than the 
linear model, and negative scores indicate 
worse performance than the linear model. As 
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may be seen, the INSPECTOR model performs 
better than the linear model when networks 
meet the model's assumptions. However, due to 
the updating procedure, the model still 
performs worse them the independence model and 
still makes substantial errors. 


DISCUSSION 


We think that these results have a number of 
inplications for expert system construction. 

First, it is clear that both the MYCIN and 
EROS MANOR models cure suitably accurate under 
sane circumstances, but can make large errors 
under other circumstances. Exactly which model 
and what parameter values are used make a 
potentially important difference in the overall 
accuracy of an expert system. It may not be 
possible to tune the system to perform with 
reliable accuracy across a broad range of 
problems, users, and solutions. In short, 
under sane circumstances one should probably 
not use the MYCIN or EROSPECKJR models. This 
conclusion is important, since these models sue 
embedded in many cxmmercial shells and are 
widely used. Indeed, neither these nor other 
models should be used uncritical ly , without 
investigations to determine appropriateness to 
the particular application under consideration. 

Second, very simple models may work well for 
many problems. A simple linear model worked as 
well as the MYCIN and PROSPECTOR models, and a 
probability-based independence model worked 
much better. Elaborate models have been 
developed to handle uncertainty in expert 
systems, tut the elaborations add little to 
accuracy and are very sensitive to differences 
in, for example, evidence-conclusion 
relationships . 

All of the uncertain inference models made 
substantial errors under some circumstances. 
This suggests that for soma difficult 
applications, custom-built uncertain inference 
models may still be required. The system 
builder should select or develop a method that 
is neither too simple nor too cctrplex for the 
application at hand. 

When an uncertain inference model is being 
considered, one need not focus entirely on the 
assumptions of the model and whether those 
assumptions are met in the application. We 
have found that seme models work well even when 
assumptions are not met (e.g. , the 
probability-based independence model and the 
linear model) and that others may work poorly 
even if assumptions are met. We believe that 
robustness is more important than theoretical 
elegance in practical expert system building. 

Finally, we believe that the empirical approach 
to evaluating uncertain inference model 
accuracy and the general methodology we have 
developed is useful. The findings summarized 


above have shed new light on the performance of 
such models, which goes beyond theoretical 
analyses. However, many questions remain 
unanswered. CUr studies have locked at only a 
few models and only at simple networks. While 
it seems likely to us that errors will tend to 
propagate and oatpound in many large networks, 
that other heuristic models will perform poorly 
in many circumstances, these issues should be 
settled empirically. We are presently 
investigating these and other issues. 
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